表达式语言
本文档所描述的表达式语言被用来定义一个预警的触发条件,在这个最高级的表达式语言中,它可以将一系列的时间序列参数归一成一个单一的变量,True或False表明一个预警是否会被触发,0代表False(不触发预警),其他任何数都代表True(触发预警)。一个预警可以监控任意数量的组合,这取决于预警所定义的范围与唯独。举个例子,你可以针对每个host、每个服务、每个集群单独建立预警,也可以为整个系统构建一个预警。
基本原理
数据结构
在bosun的表达式语言中,一共有三种基本的数据结构
- 纯量,最简单的数据结构,它是一个单一的数值,它和集合没有任何联系,需要注意的死空集,{}也是一个集合。
- 数组集合,它是一组纯量与标签组成的集合,特别的,对于空的数组集合可能会被纯量代替。
- 时间序列集合,它是由时间戳-值这样的键值对组成的关联集合。
在大多数的预警案例中,你将从时间序列数据库中获取到时间序列集合,然后将他们规约成数组集合聚合key
Group一般由你的时间序列数据库提供,我们也经常将group称作Tags,当你查询你得时间序列数据库并获得一系列的时间序列集合返回后,每一个时间序列都需要一个标示。举个例子,如果你以host=*
的条件查询,然后就可以获得每个host的时间序列集合,此时host就是Tags,具体的tags值就是host1
、host2
、host3
等,因此很对单个Tags的查询可以使用条件{host=host1}
,一个Group可以含有多个Tag,也会有一个Tag所有的的key都有。
每个Group都可以作为独立的预警实例,这就是我们所谓的广度和维度,这样你可以通过这样的语句avg(q("sum:sys.cpu{host=ny-*}", "5m", "")) > 0.8
来检查所有ny开头的服务器的cpu的使用率,通过我们的表达式语言可以方便的操作这些维度。聚合子集
属于同一个Group下地Tag可以被聚合在一起,一个聚合自己就是值所有的tag都属于另外一个聚合的情况,空集{}是任何聚合的子集,{host=foo}
是{host=foo,interface=eth0}
的子集,{host=foo,interface=eth0}
和{host=foo,partition=/}
互相不为子集,是否为子集是需要经过被考虑的事情。运算符
标准运算法(+、-、*、/、%)、关系运算符(<、>、==、!=、>=、<=)、逻辑运算符(&&、||、!)都被支持 q("query") + 1
:在“query”语句的结果集中,每个元素加1-q("query")
: 在“query”的结果集中,所有元素取反5 > q("query")
:判断“query”的结果集是否全部都小于56 / 8
:获取到数学计算结果集合运算符
如果你通过
q(..) + q(..)
这样的语句和合并两个查询结果集,那么真正产生的作用是以右边的结果为基将两个结果集中的每个元素按照一致的时间序列相加,一致的时间序列意味着所有的数据具有相同的时间戳标志(此时子集会生效),如果左边的集合的时间序列和右边的不一致,那么这个数据点就会被移除掉,这是0.50.0才有的新功能。运算优先级
从高到低
() 二进制!和-
- *、/、%
- +、-
- ==、!=、>、>=、<、<=
- &&
- ||
数值常量
数值可以被指定为小数(123.45
)、8进制(072
)、16进制(0x2A
),指数形式的表示也是被支持的(-0.8e-2
)。
基本预警
alert haproxy_session_limit { template = generic $notes = This alert monitors the percentage of sessions against the session limit in haproxy (maxconn) and alerts when we are getting close to that limit and will need to raise that limit. This alert was created due to a socket outage we experienced for that reason $current_sessions = max(q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", "")) $session_limit = max(q("sum:haproxy.frontend.slim{host=*,pxname=*,tier=*}", "5m", "")) $query = ($current_sessions / $session_limit) * 100 warn = $query > 80 crit = $query > 95 warnNotification = default critNotification = default }我们并不需要全部了解上述的预警规则,我们只需要注意其中的一些要点即可:
haproxy_session_limit
这是预警的名称,一个预警实例通过预警名称以及它所属的Group来做区分haproxy_session_limit{host=lb,pxname=http-in,tier=2}
$notes
这是一个变量, 变量不是自适应的,这里它只是字符串类型。 如果你熟悉C中得变量规则,这里的规则跟其类似。这里的变量能够被notification引用到,所以我们可以直接在alert中定义这个变量。q("sum:haproxy.frontend.scur{host=*,pxname=*,tier=*}", "5m", "")
这是一条OpenTSDB查询语句,它会返回N个元素, 通过这条语句我们知道每个元素都有host、 pxname、tier三个Tag。max(...)
这是个规约函数, 它会获取series并reduces转化为 number (情参看上面的数据结构)。$current_sessions / $session_limit
这个变量代表 numbers拥有相同的时间序列,这样我们就可以使用/
操作符来操作这两个集合。warn = $query > 80
如果这条语句返回True,那么就会触发warnNotification
。查询语句
Graphite查询语句(待续)
InfluxDB查询语句(待续)
Logstash查询语句(待续)
Elastic查询语句(待续)
OpenTSDB查询语句
本查询通过查询字符串(例如sum:os.cpu{host=*}
)来获得时间序列集合的返回q(query string, startDuration string, endDuration string) seriesSet
通用查询以当前时间减去endDuration时间作为最后的时间,往前追溯当前时间减去startDuration作为开始时间。如果endDuration被置为空字符串(""),那么最后时间就是当前时间。支持的字符串见opentsdb的文档,支持的查询语句也位于opentsdb的文档。查询参数的表达式是m=...
,*
和|
都被支持。另外,像sys.cpu.user{host=ny-*}
这样的语句也是被支持的,这样的语句执行的时候会由附加的步骤来决定好有效的匹配,替换`ny-为
ny-web01|ny-web02|...|ny-web10`,最后的结果是一致的。这个查找的过程在在系统的内存中进行,并不会增加OPENTSDB的API请求操作,但是需要scollector实例提前发送类型到bosun服务器中。
band(query string, duration string, period string, num scalar) seriesSet
band执行num
次查询,从当前时间减去period
为最后的时间,每个循环把最后时间减去period
,开始时间从当前循环的最后时间减去duration
作为查询的时间段,最后把所有结果拼接起来。以band("avg:os.cpu", "1h", "1d", 7)
为例子,返回的结果包含了一系列的查询结果,结果集包括从1天1小时前——一天前、2天1小时前——2天前等一共7天得数据集。band指令是一个很好地方式去获取定时时间的数据,能够定位过去某天某一时间段或者是每周某时间段的数据集。
over(query string, duration string, period string, num scalar) seriesSet
over的参数和band函数一致,差异在于最后的时间以当前时间计算,结果集会被分到各个Tag中去,每一个周期中得结果会有一个独立的tag。这个函数的作用主要用于展示时间序列中的一部分。举个例子,要获取每周同一天的数据就可以使用下面的语句over("avg:1h-avg:rate:os.cpu{host=ny-bosun01}", "1d", "1w", 4)
。
change(query string, startDuration string, endDuration string) numberSet
Change函数返回从时间startDuration到时间endDuration之间查询结果的改变。如果endDuration是空字符串(""
),那么将使用当前时间。这个查询需要是一个比率或者是一个被转成比率的常数,使用agg:rate:metric
能够实现这个转变。
举个例子,假如拥有一个metric叫net.bytes
,这个metric表示从系统启动开始同构某个网卡发送的数据总量。我们可以取net.bytes
开始和结束的差值,如果系统重启或者是计数器被回滚,那么我们的最终结果就会出错。为了解决这个问题,我们需要OpenTSDB把我们的metric转换成比例,并且存储每个时间的结果,这样我们想获得过去一小时的数据发送总量可以使用下述的语句:change("avg:rate:net.bytes", "60m", "")
需要注意上述例子使用了bosun中得avg函数,下面这个例子返回同样地结果:avg(q("avg:rate:net.bytes", "60m", "")) * 60 * 60
译者注:change计算时间范围内最后一个数据值和范围外最大值的差值
译者增:rate表示的时数值增量除以秒数
count(query string, startDuration string, endDuration string) scalar
count函数的参数与q函数一致,返回的结果是一个纯数,其结果是结果中得group数量。
window(query string, duration string, period string, num scalar, funcName string) seriesSet
window函数与band函数类似,总共执行num
次,最后的时间为当前时间减去period
,时长为duration
,每次循环最后时间减去period
,这最后的结果还需要经过函数funcName
的执行,此函数必须为只有一个参数的规约函数(一个规约函数将一个时间序列集合规约为一个纯数),然后这些纯数一起构建一个集合。举个例子window("avg:os.cpu{host=*}", "1h", "1d", 7, "dev")
将会返回一系列的平均数,这些平均数计算自1天1小时-1天前、2天1小时-2天前等一共7个结果集,window和band的最大区别在于window会把结果集规约成一个数,然后将这些数根据循环次数组成新的集合。
规约语句
所有规约函数以时间序列集合为输入,输出一个数据集合,其中每个时间序列集合的group返回一个元素。
avg(seriesSet) numberSet
返回时间序列集合每个group的平均数。
cCount(seriesSet) numberSet
返回相邻的两个时间序列发生变动的数目,主要用于校验数据是否发生变动,举个例子,一个集合的值为[0, 1, 0, 1],那么它将会返回3。
dev(seriesSet) numberSet
标准差(各个数减去平均数的平方相加再开方)。
diff(seriesSet) numberSet
每个group中最后一个值减去最前一个值的差值。
first(seriesSet) numberSet
返回每个group的第一个值的集合
forecastlr(seriesSet, y_val numberSet|scalar) numberSet
按照线性回归的计算方式返回比y_val的值大的时间至今的秒数。
linelr(seriesSet, d Duration) seriesSet
返回未来线性回归的曲线,两个点分别是当前时间以及当前时间加上duration的值(参看opentsdb指南),函数会在group中加入一个标签regression=line
,这个函数只要用于将表达式构建成图,举个例子:
$d = "1w"
$q = q("avg:1h-avg:os.disk.fs.percent_free{}{host=ny-tsdb*,disk=/mnt*}", "2w", "")
$line = linelr($q, "3n")
$m = merge($q, $line)
$m
last(seriesSet) numberSet
返回每个时间序列集合的最后一个数据点。
len(seriesSet) numberSet
返回每个时间序列集合的长度。
max(seriesSet) numberSet
返回时间序列集合中的最大值,和percentile(series, 1)的返回结果一致。
median(seriesSet) numberSet
返回时间序列集合中得中间值(取集合中最接近中间的值),和percentile(series, .5)的返回结果一致。
min(seriesSet) numberSet
返回时间序列中的最小值,和percentile(series, 0)的返回结果一致。
percentile(seriesSet, p numberSet|scalar) numberSet
返回时间序列集合中位于百分比p位置的值,百分比的低值和高值分别是时间序列集合的最小值和最大值,p值的取值范围为0-1。
since(seriesSet) numberSet
返回当前时间与最后一个时间序列记录的相隔时间。
streak(seriesSet) numberSet
返回时间序列集合中各个有效group(集合不能为空)中最长的集合的长度。
sum(seriesSet) numberSet
对时间序列集合的值进行求和。
聚合语句
聚合语句修改opentsdb中的group
t(numberSet, group string) seriesSet
Transposes N series of length 1 to 1 series of length N. If the group parameter is not the empty string, the number of series returned is equal to the number of tagks passed. This is useful for performing scalar aggregation across multiple results from a query. For example, to get the total memory used on the web tier: sum(t(avg(q("avg:os.mem.used{host=*-web*}", "5m", "")), ""))
.
How transpose works conceptually
Transpose Grouped results into a Single Result:
Before Transpose (Value Type is NumberSet):
Group | Value | |
---|---|---|
{host=web01} | 1 | |
{host=web02} | 7 | |
{host=web03} | 4 |
After Transpose (Value Type is SeriesSet):
Group | Value | |
---|---|---|
{} | 1,7,4 |
Transpose Groups results into Multiple Results:
Before Transpose by host (Value Type is NumberSet)
Group | Value | |
---|---|---|
{host=web01,disk=c} | 1 | |
{host=web01,disc=d} | 3 | |
{host=web02,disc=c} | 4 |
After Transpose by "host" (Value type is SeriesSet)
Group | Value | |
---|---|---|
{host=web01} | 1,3 | |
{host=web02} | 4 |
Useful Example of Transpose Alert if more than 50% of servers in a group have ping timeouts
alert or_down {
$group = host=or-*
.# bosun.ping.timeout is 0 for no timeout, 1 for timeout
$timeout = q("sum:bosun.ping.timeout{$group}", "5m", "")
.# timeout will have multiple groups, such as or-web01,or-web02,or-web03.
.# each group has a series type (the observations in the past 10 mintutes)
.# so we need to *reduce* each series values of each group into a single number:
$max_timeout = max($timeout)
.# Max timeout is now a group of results where the value of each group is a number. Since each
.# group is an alert instance, we need to regroup this into a sigle alert. We can do that by
.# transposing with t()
$max_timeout_series = t("$max_timeout", "")
.# $max_timeout_series is now a single group with a value of type series. We need to reduce
.# that series into a single number in order to trigger an alert.
$number_down_series = sum($max_timeout_series)
$total_servers = len($max_timeout_series)
$percent_down = $number_down_servers / $total_servers) * 100
warnNotification = $percent_down > 25
}
Since our templates can reference any variable in this alert, we can show which servers are down in the notification, even though the alert just triggers on 25% of or-* servers being down.
ungroup(numberSet) scalar
输入只有一个group的数组集合,将group移除后返回纯数,用于合并两个不同的group
其他语句
alert(name string, key string) numberSet
Executes and returns the key
expression from alert name
(which must be
warn
or crit
). Any alert of the same name that is unknown or unevaluated
is also returned with a value of 1
. Primarily for use with depends
.
Example: alert("host.down", "crit")
returns the crit
expression from the host.down alert.
abs(numberSet) numberSet
返回数组集合的元素的绝对值组成的新的数组。
crop(series seriesSet, start numberSet, end numberSet) seriesSet
Returns a seriesSet where each series is has datapoints removed if the datapoint is before start (from now, in seconds) or after end (also from now, in seconds). This is useful if you want to alert on different timespans for different items in a set, for example:
lookup test {
entry host=ny-bosun01 {
start = 30
}
entry host=* {
start = 60
}
}
alert test {
template = test
$q = q("avg:rate:os.cpu{host=ny-bosun*}", "5m", "")
$c = crop($q, lookup("test", "start") , 0)
crit = avg($c)
}
d(string) scalar
返回OpenTSDB时间标量中包含的秒的数量。
tod(scalar) string
给定一个秒数,根据OpenTSDB时间标量返回该秒数代表的标量,本函数与q()
成互逆关系。
des(series, alpha scalar, beta scalar) series
Returns series smoothed using Holt-Winters double exponential smoothing. Alpha (scalar) is the data smoothing factor. Beta (scalar) is the trend smoothing factor.
dropg(seriesSet, threshold numberSet|scalar) seriesSet
Remove any values greater than number from a series. Will error if this operation results in an empty series.
dropge(seriesSet, threshold numberSet|scalar) seriesSet
Remove any values greater than or equal to number from a series. Will error if this operation results in an empty series.
dropl(seriesSet, threshold numberSet|scalar) seriesSet
Remove any values lower than number from a series. Will error if this operation results in an empty series.
drople(seriesSet, threshold numberSet|scalar) seriesSet
Remove any values lower than or equal to number from a series. Will error if this operation results in an empty series.
dropna(seriesSet) seriesSet
Remove any NaN or Inf values from a series. Will error if this operation results in an empty series.
dropbool(seriesSet, seriesSet) seriesSet
Drop datapoints where the corresponding value in the second series set is non-zero. (See Series Operations for what corresponding means). The following example drops tr_avg (avg response time per bucket) datapoints if the count in that bucket was + or - 100 from the average count over the time period.
Example:
$count = q("sum:traffic.haproxy.route_tr_count{host=literal_or(ny-logsql01),route=Questions/Show}", "30m", "")
$avg = q("sum:traffic.haproxy.route_tr_avg{host=literal_or(ny-logsql01),route=Questions/Show}", "30m", "")
$avgCount = avg($count)
dropbool($avg, !($count < $avgCount-100 || $count > $avgCount+100))
epoch() scalar
Returns the Unix epoch in seconds of the expression start time (scalar).
filter(seriesSet, numberSet) seriesSet
Returns all results in seriesSet that are a subset of numberSet and have a non-zero value. Useful with the limit and sort functions to return the top X results of a query.
limit(numberSet, count scalar) numberSet
Returns the first count (scalar) results of number.
lookup(table string, key string) numberSet
Returns the first key from the given lookup table with matching tags, this searches the built-in index and so only makes sense when using OpenTSDB and sending data to /index or relaying through bosun.
lookupSeries(series seriesSet, table string, key string) numberSet
Returns the first key from the given lookup table with matching tags. The first argument is a series to use from which to derive the tag information. This is good for alternative storage backends such as graphite and influxdb.
map(series seriesSet, subExpr numberSetExpr) seriesSet
map applies the subExpr to each value in each series in the set. A special function v()
which is only available in a numberSetExpr and it gives you the value for each item in the series.
For example you can do something like the following to get the absolute value for each item in the series (since the normal abs()
function works on normal numbers, not series:
$q = q("avg:rate:os.cpu{host=*bosun*}", "5m", "")
map($q, expr(abs(v())))
Or for another example, this would get you the absolute difference of each datapoint from the series average as a new series:
$q = q("avg:rate:os.cpu{host=*bosun*}", "5m", "")
map($q, expr(abs(v()-avg($q))))
Since this function is not optimized for a particular operation on a seriesSet it may not be very efficent. If you find you are doing things that involve more complex expressions within the expr(...)
inside map (for example, having query functions in there) than you may want to consider requesting a new function to be added to bosun's DSL.
expr(expression)
expr takes an expression and returns either a numberSetExpr or a seriesSetExpr depending on the resulting type of the inner expression. This exists for functions like map
- it is currently not valid in the expression language outside of function arguments.
month(offset scalar, startEnd string) scalar
Returns the epoch of either the start or end of the month. Offset is the timezone offset from UTC that the month starts/ends at (but the returned epoch is representitive of UTC). startEnd must be either "start"
or "end"
. Useful for things like monthly billing, for example:
$hostInt = host=ny-nexus01,iname=Ethernet1/46
$inMetric = "sum:5m-avg:rate{counter,,1}:__ny-nexus01.os.net.bytes{$hostInt,direction=in}"
$outMetric = "sum:5m-avg:rate{counter,,1}:__ny-nexus01.os.net.bytes{$hostInt,direction=in}"
$commit = 100
$monthStart = month(-4, "start")
$monthEnd = month(-4, "end")
$monthLength = $monthEnd - $monthStart
$burstTime = ($monthLength)*.05
$burstableObservations = $burstTime / d("5m")
$in = q($inMetric, tod(epoch()-$monthStart), "") * 8 / 1e6
$out = q($inMetric, tod(epoch()-$monthStart), "") * 8 / 1e6
$inOverCount = sum($in > $commit)
$outOverCount = sum($out > $commit)
$inOverCount > $burstableObservations || $outOverCount > $burstableObservations
series(tagset string, epoch, value, ...) seriesSet
Returns a seriesSet with one series. The series will have a group (a.k.a tagset). The tagset can be "" for the empty group, or in "key=value,key=value" format. You can then optionally pass epoch value pairs (if non are provided, the series will be empty). This is can be used for testing or drawing arbitary lines. For example:
$now = epoch()
$hourAgo = $now-d("1h")
merge(series("foo=bar", $hourAgo, 5, $now, 10), series("foo=bar2", $hourAgo, 6, $now, 11))
shift(seriesSet, dur string) seriesSet
Shift takes a seriesSet and shifts the time forward by the value of dur (OpenTSDB duration string) and adds a tag for representing the shift duration. This is meant so you can overlay times visually in a graph.
leftjoin(tagsCSV string, dataCSV string, ...numberSet) table
leftjoin takes multiple numberSets and joins them to the first numberSet to form a table. tagsCSV is a string that is comma delimited, and should match tags from query that you want to display (i.e., "host,disk"). dataCSV is a list of column names for each numberset, so it should have the same number of labels as there are numberSets.
The only current intended use case is for constructing "Table" panels in Grafana.
For Example, the following in Grafana would create a table that shows the CPU of each host for the current period, the cpu for the adjacent previous period, and the difference between them:
$cpuMetric = "avg:$ds-avg:rate{counter,,1}:os.cpu{host=*bosun*}{}"
$currentCPU = avg(q($cpuMetric, "$start", ""))
$span = (epoch() - (epoch() - d("$start")))
$previousCPU = avg(q($cpuMetric, tod($span*2), "$start"))
$delta = $currentCPU - $previousCPU
leftjoin("host", "Current CPU,Previous CPU,Change", $currentCPU, $previousCPU, $delta)
Note that in the above example is intended to be used in Grafana via the Bosun datasource, so $start
and $ds
are replaced by Grafana before the query is sent to Bosun.
merge(SeriesSet...) seriesSet
Merge takes multiple seriesSets and merges them into a single seriesSet. The function will error if any of the tag sets (groups) are identical. This is meant so you can display multiple seriesSets in a single expression graph.
nv(numberSet, scalar) numberSet
Change the NaN value during binary operations (when joining two queries) of unknown groups to the scalar. This is useful to prevent unknown group and other errors from bubbling up.
sort(numberSet, (asc|desc) string) numberSet
Returns the results sorted by value in ascending ("asc") or descending ("desc") order. Results are first sorted by groupname and then stably sorted so that results with identical values are always in the same order.
timedelta(seriesSet) seriesSet
Returns the difference between successive timestamps in a series. For example:
timedelta(series("foo=bar", 1466133600, 1, 1466133610, 1, 1466133710, 1))
Would return a seriesSet equal to:
series("foo=bar", 1466133610, 10, 1466133710, 100)